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Abstract 


Tensor CANDECOMP/PARAFAC (CP) decomposition has wide applications in 
statistical learning of latent variable models and in data mining. In this paper, 
we propose fast and randomized tensor CP decomposition algorithms based on 
sketching. We build on the idea of count sketches, but introduce many novel ideas 
which are unique to tensors. We develop novel methods for randomized com¬ 
putation of tensor contractions via FFTs, without explicitly forming the tensors. 

Such tensor contractions are encountered in decomposition methods such as ten¬ 
sor power iterations and alternating least squares. We also design novel colliding 
hashes for symmetric tensors to further save time in computing the sketches. We 
then combine these sketching ideas with existing whitening and tensor power iter¬ 
ative techniques to obtain the fastest algorithm on both sparse and dense tensors. 

The quality of approximation under our method does not depend on properties 
such as sparsity, uniformity of elements, etc. We apply the method for topic mod¬ 
eling and obtain competitive results. 

Keywords: Tensor CP decomposition, count sketch, randomized methods, spec¬ 
tral methods, topic modeling 

1 Introduction 

In many data-rich domains such as computer vision, neuroscience and social networks consisting 
of multi-modal and multi-relational data, tensors have emerged as a powerful paradigm for han¬ 
dling the data deluge. An important operation with tensor data is its decomposition, where the 
input tensor is decomposed into a succinct form. One of the popular decomposition methods is the 
CANDECOMP/PARAFAC (CP) decomposition, also known as canonical polyadic decomposition 
mm, where the input tensor is decomposed into a succinct sum of rank-1 components. The CP 
decomposition has found numerous applications in data mining (4l IT8l l20l . computational neuro¬ 
science USEE!, and recently, in statistical learning for latent variable models 111 [30] [28] [6). For 
latent variable modeling, these methods yield consistent estimates under mild conditions such as 
non-degeneracy and require only polynomial sample and computational complexity lUI f30l [28l l6l . 

Given the importance of tensor methods for large-scale machine learning, there has been an in¬ 
creasing interest in scaling up tensor decomposition algorithms to handle gigantic real-world data 
tensors IE] El] [8] ED [II El E2- However, the previous works fall short in many ways, as described 
subsequently. In this paper, we design and analyze efficient randomized tensor methods using ideas 
from sketching (23). The idea is to maintain a low-dimensional sketch of an input tensor and then 
perform implicit tensor decomposition using existing methods such as tensor power updates, alter¬ 
nating least squares or online tensor updates. We obtain the fastest decomposition methods for both 
sparse and dense tensors. Our framework can easily handle modern machine learning applications 
with billions of training instances, and at the same time, comes with attractive theoretical guarantees. 
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Our main contributions are as follows: 


Efficient tensor sketch construction: We propose efficient construction of tensor sketches when 
the input tensor is available in factored forms such as in the case of empirical moment tensors, where 
the factor components correspond to rank-1 tensors over individual data samples. We construct 
the tensor sketch via efficient FFT operations on the component vectors. Sketching each rank-1 
component takes 0(n + b log b) operations where n is the tensor dimension and b is the sketch 
length. This is much faster than the 0{n p ) complexity for brute force computations of a pth-order 
tensor. Since empirical moment tensors are available in the factored form with N components, 
where N is the number of samples, it takes 0((n + b log b)N) operations to compute the sketch. 

Implicit tensor contraction computations: Almost all tensor manipulations can be expressed in 
terms of tensor contractions, which involves multilinear combinations of different tensor fibres ED- 
For example, tensor decomposition methods such as tensor power iterations, alternating least squares 
(ALS), whitening and online tensor methods all involve tensor contractions. We propose a highly 
efficient method to directly compute the tensor contractions without forming the input tensor ex¬ 
plicitly. In particular, given the sketch of a tensor, each tensor contraction can be computed in 
0{n + b log b) operations, regardless of order of the source and destination tensors. This signifi¬ 
cantly accelerates the brute-force implementation that requires 0(n p ) complexity forpth-order ten¬ 
sor contraction. In addition, in many applications, the input tensor is not directly available and needs 
to be computed from samples, such as the case of empirical moment tensors for spectral learning 
of latent variable models. In such cases, our method results in huge savings by combining implicit 
tensor contraction computation with efficient tensor sketch construction. 

Novel colliding hashes for symmetric tensors: When the input tensor is symmetric, which is the 
case for empirical moment tensors that arise in spectral learning applications, we propose a novel 
colliding hash design by replacing the Boolean ring with the complex ring C to handle multiplicities. 
As a result, it makes the sketch building process much faster and avoids repetitive FFT operations. 
Though the computational complexity remains the same, the proposed colliding hash design results 
in significant speed-up in practice by reducing the actual number of computations. 

Theoretical and empirical guarantees: We show that the quality of the tensor sketch does not 
depend on sparseness, uniform entry distribution, or any other properties of the input tensor. On the 
other hand, previous works assume specific settings such as sparse tensors l;24l f§1 H6l . or tensors 
having entries with similar magnitude ED- Such assumptions are unrealistic, and in practice, we 
may have both dense and spiky tensors, for example, unordered word trigrams in natural language 
processing. We prove that our proposed randomized method for tensor decomposition does not lead 
to any significant degradation of accuracy. 

Experiments on synthetic and real-world datasets show highly competitive results. We demonstrate 
a lOx to lOOx speed-up over exact methods for decomposing dense, high-dimensional tensors. For 
topic modeling, we show a significant reduction in computational time over existing spectral LDA 
implementations with small performance loss. In addition, our proposed algorithm outperforms 
collapsed Gibbs sampling when running time is constrained. We also show that if a Gibbs sampler is 
initialized with our output topics, it converges within several iterations and outperforms a randomly 
initialized Gibbs sampler run for much more iterations. Since our proposed method is efficient and 
avoids local optima, it can be used to accelerate the slow burn-in phase in Gibbs sampling. 

Related Works: There have been many works on deploying efficient tensor decomposition meth¬ 
ods El Ell EllIS] El 12IID- Most of these works except El [2 implement the alternating least 
squares (ALS) algorithm 1320. However, this is extremely expensive since the ALS method is 
run in the input space, which requires 0(n 3 ) operations to execute one least squares step on an 
n-dimensional (dense) tensor. Thus, they are only suited for extremely sparse tensors. 

An alternative method is to first reduce the dimension of the input tensor through procedures such as 
whitening to 0(k ) dimension, where k is the tensor rank, and then carry out ALS in the dimension- 
reduced space on k x k x k tensor ED. This results in significant reduction of computational 
complexity when the rank is small ( k <C n). Nonetheless, in practice, such complexity is still 
prohibitively high as k could be several thousands in many settings. To make matters even worse, 
when the tensor corresponds to empirical moments computed from samples, such as in spectral 
learning of latent variable models, it is actually much slower to construct the reduced dimension 
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Table 1: Summary of notations. See also AppendixjF] 


Variables 

Operator 

Meaning 

Variables 

Operator 

Meaning 

a, b 6 C n 

aobeC n 

Element-wise product 

a E C n 

£ j^nxnxn 

a © a © a 

a.b E C n 

a* b E C n 

Convolution 

A, B 6 C nxm 

A© B 6 C n2xm 

Khatri-Rao product 

a,beC n 

a®bE C nxn 

Tensor product 

T G (J^ nXnXn 

T (1) 6 C nxn2 

Mode expansion 


k x k x k tensor from training data than to decompose it, since the number of training samples is 
typically very large. Another alternative is to carry out online tensor decomposition, as opposed to 
batch operations in the above works. Such methods are extremely fast m , but can suffer from high 
variance. The sketching ideas developed in this paper will improve our ability to handle larger sizes 
of mini-batches and therefore result in reduced variance in online tensor methods. 

Another alternative method is to consider a randomized sampling of the input tensor in each iteration 
of tensor decomposition Em However, such methods can be expensive due to I/O calls and 
are sensitive to the sampling distribution. In particular, E3 employs uniform sampling, which is 
incapable of handling tensors with spiky elements. Though non-uniform sampling is adopted in 0, 
it requires an additional pass over the training data to compute the sampling distribution. In contrast, 
our sketch based method takes only one pass of the data. 


2 Preliminaries 


Tensor, tensor product and tensor decomposition A 3rd order tensor[]T of dimension n has n 3 
entries. Each entry can be represented as Tyj. for i,j, k £ {1, • • ■ , n}. For an n x n x n tensor T 
and a vector u £ R”, we define two forms of tensor products (contractions) as follows: 


n 

rz, u) — ^ ^ T.j j kUiUjUk, T(I, tr, rx) — 


''y ' T 1 .y. / X/ ; 11 . 5 ^ ^ T?y.y/\ / t/y 'll }■ 

j-k—\ j,k =1 


Note that T(tt, u. u) £ R and T(I, u. u) £ R". For two complex tensors A, B of the same order 
and dimension, its inner product is defined as (A, B) := AjB;, where l ranges over all tuples 
that index the tensors. The Frobenius norm of a tensor is simply ||A|jp = y 7 (A, A). 

The rank-k CP decomposition of a 3rd-order //-dimensional tensor T £ R" x " x '» in¬ 
volves scalars {A,:}^ =1 and n-dimensional vectors {a/, b./, Ci}^ =1 such that the residual ||T — 

i ^i a i <8> bi qIIf is minimized. Here R = a ® 6 c is a 3rd order tensor defined as 
Ryyfe = o,b ; C/ ; . Additional notations are defined in Tableland Appendix|F] 


Robust tensor power method The method was proposed in (T) and was shown to provably suc¬ 
ceed if the input tensor is a noisy perturbation of the sum of k rank-1 tensors whose base vectors 
are orthogonal. Fix an input tensor T £ K nxnx ", The basic idea is to randomly generate L initial 
vectors and perform T power update steps: ii = T(I, u 1 u) /||T(I. u, u)\\ 2 . The vector that results 
in the largest eigenvalue T(it, it, u) is then kept and subsequent eigenvectors can be obtained via 
deflation. If implemented naively, the algorithm takes 0(kn 3 LT) time to runJ^J requiring 0(n 3 ) 
storage. In addition, in certain cases when a second-order moment matrix is available, the tensor 
power method can be carried out on a k X k x k whitened tensor 12 , thus improving the time com¬ 
plexity by avoiding dependence on the ambient dimension n. Apart from the tensor power method, 
other algorithms such as Alternating Feast Squares (AFS, Ifl2l l5ll and Stochastic Gradient Descent 
(SGD, lfl4l ) have also been applied to tensor CP decomposition. 


Tensor sketch Tensor sketch was proposed in ||23l as a generalization of count sketch 0. For 
a tensor T of dimension ni x ••• x n p , random hash functions ,h p : [n] —> [6] with 

Pr hj [hj ( i ) = t\ = 1/b for every i £ [ n\,j £ [p],t £ [6] and binary Rademacher variables 
Cl; ‘ ‘ ‘ > ; [ n ] ~l► {±1}, the sketch St : [6] —> R of tensor T is defined as 

st(/) = ^2 CiO't) • • (1) 

H(i i}*" t 

1 Though we mainly focus on 3rd order tensors in this work, extension to higher order tensors is easy. 

2 L is usually set to be a linear function of k and T is logarithmic in n; see Theorem 5.1 in m 
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where H(ii, ■ ■ ■ . i p ) = (hi{i\) + ■ ■ • + h p (i p )) mod b. The corresponding recovery rule is 
T n> ... • • • £ p (i p )sT(H(ii, ■ ■ ■ , i p )). For accurate recovery, H needs to be 2-wise in¬ 

dependent, which is achieved by independently selecting hi , • • • ,h p from a 2-wise independent 
hash family l26l . Finally, the estimation can be made more robust by the standard approach of 
taking B independent sketches of the same tensor and then report the median of the B estimates CO- 

3 Fast tensor decomposition via sketching 

In this section we first introduce an efficient procedure for computing sketches of factored or empir¬ 
ical moment tensors, which appear in a wide variety of applications such as parameter estimation of 
latent variable models. We then show how to run tensor power method directly on the sketch with 
reduced computational complexity. In addition, when an input tensor is symmetric (i.e., Tijk the 
same for all permutations of i,j, k) we propose a novel “colliding hash” design, which speeds up 
the sketch building process. Due to space limits we only consider the robust tensor power method 
in the main text. Methods and experiments for sketching based ALS are presented in Appendix |C| 

To avoid confusions, we emphasize that n is used to denote the dimension of the tensor to be decom¬ 
posed, which is not necessarily the same as the dimension of the original data tensor. Indeed, once 
whitening is applied n could be as small as the intrinsic dimension k of the original data tensor. 

3.1 Efficient sketching of empirical moment tensors 

Sketching a 3rd-order dense //-dimensional tensor via Eq. (jTJ takes 0(n 3 ) operations, which in 
general cannot be improved because the input size is (2(n 3 ). However, in practice data tensors are 
usually structured. One notable example is empirical moment tensors, which arises naturally in 
parameter estimation problems of latent variable models. More specifically, an empirical moment 
tensor can be expressed as T = E[a?® 3 ] = A xf 3 , where N is the total number of training 
data points and x t is the /th data point. In this section we show that computing sketches of such 
tensors can be made significantly more efficient than the brute-force implementations via Eq. 0- 
The main idea is to sketch low-rank components ol T efficiently via FFT, a trick inspired by previous 
efforts on sketching based matrix multiplication and kernel learning Il22lf23l . 

We consider the more generalized case when an input tensor T can be written as a weighted sum 
of known rank-1 components: T = a ,u, 0 Vi 0 w,, where a, are scalars and Ui,Vi, Wi are 

known n-dimensional vectors. The key observation is that the sketch of each rank-1 component 
Tj = u, 0 v, 0 tv, can be efficiently computed by FFT. In particular, ,s T can be computed as 

= s 1}Ui * s 2 , Vi * s 3>Wi = •7 r-1 (-7 r (si,'u l ) ° F{s 2 , Vi ) ° F{S3, Wi )), (2) 

where * denotes convolution and o stands for element-wise vector product. s lM (i) = 
S/ il (i)=t * s the count sketch of u and s 2 }V ,S 3 ,w are defined similarly. T and J 7-1 de¬ 

note the Fast Fourier Transform (FFT) and its inverse operator. By applying FFT, we reduce the 
convolution computation into element-wise product evaluation in the Fourier space. Therefore, St 
can be computed using 0(n + b log b) operations, where the 0(b log b) term arises from FFT evalua¬ 
tions. Finally, because the sketching operator is linear (i.e., s( JA a^T,) = JA ajs(Tj)), Sx can be 
computed in 0(N(n + 6 log b)), which is much cheaper than brute-force that takes 0(Nn 3 ) time. 

3.2 Fast robust tensor power method 

We are now ready to present the fast robust tensor power method, the main algorithm of this paper. 
The computational bottleneck of the original robust tensor power method is the computation of two 
tensor products: T(I, u, u) and T(u. u. u). A naive implementation requires 0(n 3 ) operations. 
In this section, we show how to speed up computation of these products. We show that given the 
sketch of an input tensor T, one can approximately compute both T(I,it, u) and T(u,u,u) in 
0(6 log b + n) steps, where b is the hash length. 

Before going into details, we explain the key idea behind our fast tensorproduct computation. For 
any two tensors A, B, its inner product (A, B) can be approximated by FT 

(A, B) m (s A , s B ). (3) 


•) denotes the real part of a complex number, med(-) denotes the median. 


4 All approximations will be theoretically justified in SectionEland Appendix 
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Algorithm 1 Fast robust tensor power method 


1: Input: noisy symmetric tensor T = T + E G l nxnxn ; target rank number of initializations 
L, number of iterations T, hash length b, number of independent sketches B. 

2: Initialization: h^ m \ for j £ {l,2,3}andmS [B]; compute sketches £ C b . 

3: for r = 1 to L do 

4: Draw Uq uniformly at random from unit sphere. 

5: for t = 1 to T do 

6: For each m € [B],j £ {2, 3} compute the sketch of u f T _}\ using li " 1 ' 1 via Eq. (ITj). 

7: Compute v( m ' ) ss T(I, u^\, u^\) as follows: first evaluate s = •7 r ~ 1 (.F(sjj) o 


-^(4 m J)°^(45))- Set b (m) ]t as <- £i(i)[s (m) ]M») for ever Y * £ W- 

8: Set Vi t— med(5R(n- 1 ^), • • • , Update: u = u/||tj||. 


9: Selection Compute Ass T(u^ \ u^\u^') using s'£ l> for r £ [L] and m £ [B\. Evaluate 


„( m ) 


X T = med(Ai 1 \ ■ • • , A^) and r* = argmax T A T . Set A = A T * and u = 1 ■ 

10: Deflation For each m £ [B] compute sketch 5^ for the rank-1 tensor AT = Am 03 . 

11: Output: the eigenvalue/eigenvector pair (A, u) and sketches of the deflated tensor T — AT. 


Table 2: Computational complexity of sketched and plain tensor power method, n is the tensor dimension; k is 
the intrinsic tensor rank; b is the sketch length. Per-sketch time complexity is shown. 



Plain 

Sketch 

Plain+Whitening 

Sketch+Whitening 

preprocessing: general tensors 

- 

Ofa 3 ) 

0(kn 3 ) 

Opr 5 ) 

preprocessing: factored tensors 
with N components 

0(Nn 3 ) 

0{N{n + b log 6)) 

0(N(nk + k 3 )) 

0(N(nk + blog 6)) 

per tensor contraction time 

0(n 3 ) 

0(n + Mogb) 

Q(k 3 ) 

0(k + blogb) 


Eq. Oil immediately results in a fast approximation procedure of T(u, it, u) because T(it, it, it) = 
(T, xj where X = itCmCgm is a rank one tensor, whose sketch can be built in 0(n+b log b) time by 
Eq. (pi. Consequently, the product can be approximately computed using ()(n + b log b) operations 
if the tensor sketch of T is available. For tensor product of the form T(I, it, it). The ith coordinate 
in the result can be expressed as (T, Y,) where Y, = e.i (g) u ® it; = (0, • • • , 0, 1, 0, ■ • • , 0) is 
the ith indicator vector. We can then apply Eq. (|3]l to approximately compute (T. Y,) efficiently. 
However, this method is not completely satisfactory because it requires sketching n rank-1 tensors 
(Y i through Y„), which results in ()(n) FFT evaluations by Eq. (pb. Below we present a proposition 
that allows us to use only 0(1) FFTs to approximate T(I, it, it). 

Proposition 1. (s T ,s lje . * s 2 , u * s 3 , u ) = (J‘ _1 (J"(s T ) ° F(s 2 , u ) ° F(s^ u )), s 1 ^ i ). 

Proposition [T]is proved in Appendix |E.1| The main idea is to “shift” all terms not depending on i to 
the left side of the inner product and eliminate the inverse FFT operation on the right side so that s e . 
contains only one nonzero entry. As a result, we can compute J r_1 (J r (s t) o J-(s 2 . u ) 0 bF(s 3 U )) 
once and read off each entry of T(I, it, it) in constant time. In addition, the technique can be 
further extended to symmetric tensor sketches, with details deferred to Appendix [B] due to space 
limits. When operating on an n-dimensional tensor. The algorithm requires 0(kLT(n + Bh log b)) 
running time (excluding the time for building ,s T ) and 0(Bb) memory, which significantly improves 
the 0{kn 3 LT) time and 0(n 3 ) space complexity over the brute force tensor power method. Here 
L. T are algorithm parameters for robust tensor power method. Previous analysis shows that T = 
0(log k) and L = poly(fc), where poly(-) is some low order polynomial function. |l] 

Finally, Table [2] summarizes computational complexity of sketched and plain tensor power method. 

3.3 Colliding hash and symmetric tensor sketch 

For symmetric input tensors, it is possible to design a new style of tensor sketch that can be built 
more efficiently. The idea is to design hash functions that deliberately collide symmetric entries, i.e., 
(i,j, k ), ( j , i, k), etc. Consequently, we only need to consider entries T l3 k with i <j<k when 
building tensor sketches. An intuitive idea is to use the same hash function and Rademacher random 
variable for each order, that is, h\{i) = h- 2 {i) = h^{i) =: h{i) and (*) = ^(*) = ^3(®) =: £(*)• 
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In this way, all permutations of (z, j, k ) will collide with each other. However, such a design has an 
issue with repeated entries because £(z) can only take ±1 values. Consider (z, i, k) and (j, j, k ) as 
an example: £(z) 2 £(fc) = £(j) 2 £(k) with probability 1 even if i / j. On the other hand, we need 
E[£(a)£(6)] = 0 for any pair of distinct 3-tuples a and b. 

To address the above-mentioned issue, we extend the Rademacher random variables to the complex 
domain and consider all roots of z m = 1, that is, LI = where uij = e 1 ^ . Suppose <j(i) is 

a Rademacher random variable with Pr[cr(z) = u;*] = 1/m. By elementary algebra, E[er(z) p ] = 0 
whenever m is relative prime to p or rn can be divided by p. Therefore, by setting m = 4 we avoid 
collisions of repeated entries in a 3rd order tensor. More specifically. The symmetric tensor sketch 
of a symmetric tensor T £ I" X "X" can be defined as 

ST(t) ■■= 'Yl T hTfc°'(*) fT (i) f7 ( fc )> ( 4 ) 

H(i,j,k)=t 

where H(i, j, k) = ( h(i ) + h(j) + h(k)) mod b. To recover an entry, we use 

T ij,k = l/« • i) ■ a(j) ■ a(k) ■ s T {H(i,j,k)), (5) 

where n = 1 if z = j = k\ n = 3 if z = j or j = k or z = k\ n = 6 otherwise. For higher order 
tensors, the coefficients can be computed via the Young tableaux which characterizes symmetries 
under the permutation group. Compared to asymmetric tensor sketches, the hash function h needs 
to satisfy stronger independence conditions because we are using the same hash function for each 
order. In our case, h needs to be 6-wise independent to make H 2-wise independent. The fact is due 
to the following proposition, which is proved in Appendix |E.l| 

Proposition 2. Fix p and q. For h : [zz] —> [6] define symmetric mapping H : [?z] p —> [6] as 
H(ii, ■ ■ ■ ,i p ) = h(i±) + ■ ■ ■ + h(i p ). Ifh is ( pq)-wise independent then fT is q-wise independent. 

The symmetric tensor sketch described above can significantly speed up sketch building processes. 
For a general tensor with M nonzero entries, to build .s T one only needs to consider roughly M /6 
entries (those T ijk 0 with i < j < k). For a rank-1 tensor it® 3 , only one FFT is needed to build 
T(s)\ in contrast, to compute Eq. (j2j) one needs at least 3 FFT evaluations. 

Finally, in Appendix[B]we give details on how to seamlessly combine symmetric hashing and tech¬ 
niques in previous sections to efficiently construct and decompose a tensor. 

4 Error analysis 

In this section we provide theoretical analysis on approximation error of both tensor sketch and the 
fast sketched robust tensor power method. We mainly focus on symmetric tensor sketches, while 
extension to asymmetric settings is trivial. Due to space limits, all proofs are placed in the appendix. 

4.1 Tensor sketch concentration bounds 

Theorem IT] bounds the approximation error of symmetr ic te nsor sketches when computing 
T(it, u, tz) and T(I, u. u). Its proof is deferred to Appendix |E.2| 

Theorem 1. Fix a symmetric real tensor T £ R"xnxn and a real vector u £ R” with ||«||2 = 1. 
Suppose £i ,t(u) £ R and £2 ,t{u) £ R n are estimation errors of T(u,u,u) and T(I,it,iz) 
using B independent symmetric tensor sketches; that is, £i,t(m) = T(ix,tz,it) — T (u,u,u) and 
£2 ,t(u) = T(I, zz, u) — T(I, u, u). IfB = fl(log(l/i5)) then with probability >1 — 8 the following 
error bounds hold: 

|e liT («)| =0 (||T|| f /V 6); | [e 2 ,!■(«)]* | = 0(\\T\\ F /Vb), Vz £ {1, • • • , n}. (6) 

In addition, for any fixed w £ ]R", ||zn ||2 = 1 with probability > 1 — 5 we have 

(w,e 2 , T (u)) 2 = 0(\\T\\ 2 F /b). (7) 

4.2 Analysis of the fast tensor power method 

We present a theorem analyzing robust tensor power method with tensor sketch approximations. A 
more detailed theorem statement along with its proof can be found in Appendix |E.3| 

Theorem 2. Suppose T = T + E £ R nxnxn where T = ^i v T 3 with an orthonor¬ 

mal basis {vi}i =1 , Ai > • • • > Afc > 0 and ||E|| = e. Let {(A,,z)i)}/ =1 be the eigen- 
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Table 3: Squared residual norm on top 10 recovered eigenvectors of lOOOd tensors and running time (excluding 
I/O and sketch building time) for plain (exact) and sketched robust tensor power methods. Two vectors are 
considered mismatch (wrong) if ||i> — -£>||2 > 0.1. A extended version is shown as Table|5]in Appendix |X| 



log 2 (b): 


Residual norm 


No. of wrong vectors 


Running time (min.) 


12 

13 

14 

15 

16 

12 

13 

14 

15 

16 

12 

13 

14 

15 

16 

,—1 

B = 20 

.40 

.19 

.10 

.09 

.08 

8 

6 

3 

0 

0 

.85 

1.6 

3.5 

7.4 

16.6 


B = 30 

.26 

.10 

.09 

.08 

.07 

7 

5 

2 

0 

0 

1.3 

2.4 

5.3 

11.3 

24.6 

II 

B = 40 

.17 

.10 

.08 

.08 

.07 

7 

4 

0 

0 

0 

1.8 

3.3 

7.3 

15.2 

33.0 

b 

Exact 

.07 





0 





293.5 





Table 4: Negative log-likelihood and running time (min) on the large Wikipedia dataset for 200 and 300 topics. 


k 

like. 

time 

log 2 & 

iters 

k 

like. 

time 

log, 6 

iters 


Spectral 

7.49 

34 

12 

- 


7.39 

56 

13 

- 

0 

<N 

Gibbs 

6.85 

561 

- 

30 

0 

CO 

6.38 

818 

- 

30 

Hybrid 

6.77 

144 

12 

5 

6.31 

352 

13 

10 


value/eigenvector pairs obtained by Algorithm^ 7] Suppose e = 0(l/(Ain)), T = 0(log(n/5) + 
log(l/e) maxi Aj/(A,; — A;_i)) and L grows linearly with k. Assume the randomness of the tensor 
sketch is independent among tensor product evaluations. If B = Sl(log(ro/<5)) and b satisfies 


b = n( 




max 


£- 2 llT||2 S- 4 n 2 \\T\\l 


( 8 ) 


A(A) 2 ’ r{\y-\\ 

where A(A) = min. ( (A,; — A;_i) and r( A) = / Xj), then with probability >1 — 5 there 

exists a permutation n over [/c] such that 

II^TrCi) - Vih < e, |A ff(i ) — A*| < A*e/2, Vi G {1, ■■■,&} (9) 

and ||T - ^i*? 3 || < cefor some constant c. 


Theorem[l]shows that the sketch length b can be set as o(n 3 ) to provably approximately decompose 
a 3rd-order tensor with dimension n. Theorem[l]together with time complexity comparison in Table 
[2]shows that the sketching based fast tensor decomposition algorithm has better computational com¬ 
plexity over brute-force implementation. One potential drawback of our analysis is the assumption 
that sketches are independently built for each tensor product (contraction) evaluation. This is an ar¬ 
tifact of our analysis and we conjecture that it can be removed by incorporating recent development 
of differentially private adaptive query framework 0. 


5 Experiments 

We demonstrate the effectiveness and efficiency of our proposed sketch based tensor power method 
on both synthetic tensors and real-world topic modeling problems. Experimental results involving 
the fast ALS method are presented in Appendix |C.3| All methods are implemented in C++ and 
tested on a single machine with 8 Intel X5550@2.67Ghz CPUs and 32GB memory. For synthetic 
tensor decomposition we use only a single thread; for fast spectral LDA 8 to 16 threads are used. 


5.1 Synthetic tensors 

In Table [5] we compare our proposed algorithms with exact decomposition methods on synthetic 
tensors. Let n = 1000 be the dimension of the input tensor. We first generate a random orthonormal 
basis and then set the input tensor T as T = normalizeA i v f 3 ) + E, where the 

eigenvalues \ satisfy Aj = 1/i. The normalization step makes ||T||p = 1 before imposing noise. 
The Gaussian noise matrix E is symmetric with E ijk ~ 7V"(0, cr/n 13 ) for i< j < k and noise-to- 
signal level a. Due to time constraints, we only compare the recovery error and running time on the 
top 10 recovered eigenvectors of the full-rank input tensor T. Both L and T are set to 30. Table[3] 
shows that our proposed algorithms achieve reasonable approximation error within a few minutes, 
which is much faster then exact methods. A complete version (Table[5} is deferred to Appendix |A| 


5.2 Topic modeling 

We implement a fast spectral inference algorithm for Latent Dirichlet Allocation (LDA 0) by com¬ 
bining tensor sketching with existing whitening technique for dimensionality reduction. Implemen- 
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Figure 1: Left: negative log-likelihood for fast and exact tensor power method on Wikipedia dataset. Right: 
negative log-likelihood for collapsed Gibbs sampling, fast LDA and Gibbs sampling using Fast LDA as initial¬ 
ization. 

tation details are provided in Appendix [D] We compare our proposed fast spectral LDA algorithm 
with baseline spectral methods and collapsed Gibbs sampling (using GibbsLDA++ l25l implemen¬ 
tation) on two real-world datasets: Wikipedia and Enron. Dataset details are presented in [A] Only 
the most frequent V words are kept and the vocabulary size V is set to 10000. For the robust tensor 
power method the parameters are set to L = 50 and T = 30. For ALS we iterate until convergence, 
or a maximum number of 1000 iterations is reached. ao is set to 1.0 and B is set to 30. 

Obtained topic models <1> G R Vx K are evaluated on a held-out dataset consisting of 1000 documents 
randomly picked out from training datasets. For each testing document d, we fit a topic mixing vector 
•kfi £ Hl K by solving the following optimization problem: kd = argmin|| 7r || 1 _ 17r>0 ||ti; ( ; — <I?7r||2, 
where wj is the empirical word distribution of document d. The per-document log-likelihood is 
then defined as C d = ^ In p(w d i), where p(w di ) = J2k=i '*k®w di ,k- Finally, the average 
Cd over all testing documents is reported. 

Figure [T] left shows the held-out negative log-likelihood for fast spectral LDA under different hash 
lengths b. We can see that as b increases, the performance approaches the exact tensor power method 
because sketching approximation becomes more accurate. On the other hand. Table [6] shows that 
fast spectral LDA runs much faster than exact tensor decomposition methods while achieving com¬ 
parable performance on both datasets. 

Figure [T] right compares the convergence of collapsed Gibbs sampling with different number of 
iterations and fast spectral LDA with different hash lengths on Wikipedia dataset. For collapsed 
Gibbs sampling, we set a = 50 /K and 3 = 0.1 following ifTTl . As shown in the figure, fast spectral 
LDA achieves comparable held-out likelihood while running faster than collapsed Gibbs sampling. 
We further take the dictionary $ output by fast spectral LDA and use it as initializations for collapsed 
Gibbs sampling (the word topic assignments z are obtained by 5-iteration Gibbs sampling, with the 
dictionary $ fixed). The resulting Gibbs sampler converges much faster: with only 3 iterations 
it already performs much better than a randomly initialized Gibbs sampler run for 100 iterations, 
which takes lOx more running time. 

We also report performance of fast spectral LDA and collapsed Gibbs sampling on a larger dataset 
in Table [4] The dataset was built by crawling 1,085,768 random Wikipedia pages and a held-out 
evaluation set was built by randomly picking out 1000 documents from the dataset. Number of 
topics k is set to 200 or 300, and after getting topic dictionary from fast spectral LDA we use 2- 
iteration Gibbs sampling to obtain word topic assignments z. Table [4] shows that the hybrid method 
(i.e., collapsed Gibbs sampling initialized by spectral LDA) achieves the best likelihood performance 
in a much shorter time, compared to a randomly initialized Gibbs sampler. 

6 Conclusion 

In this work we proposed a sketching based approach to efficiently compute tensor CP decom¬ 
position with provable guarantees. We apply our proposed algorithm on learning latent topics of 
unlabeled document collections and achieve significant speed-up compared to vanilla spectral and 
collapsed Gibbs sampling methods. Some interesting future directions include further improving 
the sample complexity analysis and applying the framework to a broader class of graphical models. 

Acknowledgement: Anima Anandkumar is supported in part by the Microsoft Faculty Fellowship 
and the Sloan Foundation. Alex Smola is supported in part by a Google Faculty Research Grant. 
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Appendix A Supplementary experimental results 


The Wikipedia dataset is built by crawling all documents in all subcategories within 3 layers below 
the science category. The Enron dataset is from the Enron email corpus El. After usual cleaning 
steps, the Wikipedia dataset has 114, 274 documents with an average 512 words per document; the 
Enron dataset has 186, 501 emails with average 91 words per email. 


Table 5: Squared residual norm on top 10 recovered eigenvectors of lOOOd tensors and running time (excluding 
I/O and sketch building time) for plan (exact) and sketched robust tensor power methods. Two vectors are 
considered mismatched (wrong) if ||« — m||| > 0 . 1 . 



log 2 ( 6 ): 


Residual norm 


No. of wrong vectors 


Running time (min.) 

12 

13 

14 

15 

16 

12 

13 

14 

15 

16 

12 

13 

14 

15 

16 


B = 20 

.40 

.19 

.10 

.09 

.08 

8 

6 

3 

0 

0 

.85 

1.6 

3.5 

7.4 

16.6 

O 

B = 30 

.26 

.10 

.09 

.08 

.07 

7 

5 

2 

0 

0 

1.3 

2.4 

5.3 

11.3 

24.6 

II 

B = 40 

.17 

.10 

.08 

.08 

.07 

7 

4 

0 

0 

0 

1.8 

3.3 

7.3 

15.2 

33.0 

b 

Exact 

.07 





0 





293.5 





B = 20 

.52 

3.1 

.21 

.18 

.17 

8 

7 

4 

0 

0 

.84 

1.6 

3.5 

7.5 

16.8 


B = 30 

4.0 

.24 

.19 

.17 

.16 

7 

5 

3 

0 

0 

1.3 

2.5 

5.4 

11.6 

26.2 


B = 40 

.30 

.22 

.18 

.17 

.16 

7 

4 

0 

0 

0 

1.8 

3.3 

7.3 

15.5 

33.5 


Exact 

.16 





0 





271.8 





Table 6 : Selected negative log-likelihood and running time (min) for fast and exact spectral methods on 
Wikipedia (top) and Enron (bottom) datasets. 




k 

= 50 


k 

= 100 


k 

= 200 


Fast RB 

RB 

ALS 

Fast RB 

RB 

ALS 

Fast RB 

RB 

ALS 


like. 

8.01 

7.94 

8.16 

7.90 

7.81 

7.93 

7.86 

7.77 

7.89 


time 

2.2 

97.7 

2.4 

6.8 

135 

29.3 

57.3 

423 

677 


log 2 & 

10 

- 

- 

12 

- 

- 

14 

- 

- 

C 

like. 

8.31 

8.28 

8.22 

8.18 

8.09 

8.30 

8.26 

8.18 

8.27 

H 

2 

time 

2.4 

45.8 

5.2 

3.7 

93.9 

40.6 

6.4 

219 

660 

W 

log 2 6 

11 

- 

- 

11 

- 

- 

11 

- 

- 


Appendix B Fast tensor power method via symmetric sketching 


In this section we show how to do fast tensor power method using symmetric tensor sketches. More 
specifically, we explain how to approximately compute T(it, u. u) and T(I, u. u) when colliding 
hashes are used. 


For symmetric tensors A and B, their inner product can be approximated by 


(A, B) ~ (Sa,5jj), 

where B is an “upper-triangular” tensor defined as 


( 10 ) 


/ H i.Jj.k - if * < J A k\ 

1 0, otherwise. 


( 11 ) 


Note that in Eq. ( [TO i onl y the matrix B is “truncated”. We show this gives consistent estimates of 
(A, B) in Appendix E.2| 


Recall that T(it, u, u) = (T, X) where X = it ® u Q u. The symmetric tensor sketch 5^. can be 
computed as 

= rtS2.UOU * &U T (12) 

6 2 d 


where s 2 , uou (f) = E2 h(i)=t a ( i ) 2 ' u i and S3,1x0 uou (t) = Y!,3h( i i)=t a ( i ) 3u i- As a result, 

T(m, it, it) m ^(J'(sT),^ r (S u )oJ r (S u )oJ r (5 u ))+I(J'(5 T ) j jr(i 2 ou )oJ'(S u ))+^( 5 t , 
0 2 6 


S3,uouou 


)• 


(13) 
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Algorithm 2 Fast ALS method 
1: Input: T £ I"x nx ", target rank k, T, B, b. 

2 : Initialize: B independent index hash functions h - 1 ' 1 , ■ ■ ■ . h 1 ,S) and ... . a (,i h random 
matrices A, B, C £ R nxk ; {A,}.f =1 . 

3: For m = 1, • • • ,B compute £ C b . 

4: for t = 1 to T do 

5: Compute count sketches s&., s c . for i = 1, • • • , k. For each i = 1, • • • , k; m = 1, • • • , b 

compute n- m) ss T(I, b t , q). 

6: Vij <- med(5R(n^ ) ), • • • , ) )). 

7: Set A = {v}ij and A,; = ||di||; afterwards, normalize each column of A. 

8 : Update B and C similarly. 

9: Output: eigenvalues {Aj}^ =1 ; solutions A,B, C. 

For T(I, u, u) recall that [T(I, u , u)]i = (T, Y^) where Y,; = 0 « ® u. We first symmetrize it 

by defining Z= e* © u <g> u + u <g> e* © u + u © u © e*. 0The sketch of Zj can be subsequently 
computed as 

^2, — 2^ U * ©/ * ^ei T 2,uou * &ei 4“ ^2,eiOu * A' ^3,eiOuou ■ (14) 

Consequently, 

T(I, u, u) « (j(s T ) ° ,F(§ U )) , S 2 , ei o u ^ + ^ (V _1 (j(ST)oA'(«u)oJ(^)j ,s ei ) 

,-uo-u)^ i &ei ^ ^3,eiOuou) ■ (15) 

Note that all of s ei , S 2 , ei ou and S 3 , eiOUOU have exactly one nonzero entries. So we can pre-compute 
all terms on the left sides of inner products in Eq. ( |~i~5j ) and then read off the values for each entry in 
T(I, u, u). 


Appendix C Fast ALS: method and simulation result 


In this section we describe how to use tensor sketching to accelerate the Alternating Least Squares 
(ALS) method for tensor CP decomposition. We also provide experimental results on synthetic data 
and compare our fast ALS implementation with the Matlab tensor toolbox [32, 33], which is widely 
considered to be the state-of-the-art for tensor decomposition. 


C.l Alternating Least Squares 

Alternating Least Squares (ALS) is a popular method for tensor CP decompositions fl9l . The 
algorithm maintains A £ Si A: , A, B, C £ R" xfc and iteratively perform the following update steps: 

A = T (1) (C©B)(C T CoB T B) t . (16) 

B = T (1) (A © C)(A t A o CTC)*; 

C = T (1) (B 0 A)(B t B o A t A^. 

After each update, A r is set to ||a r ||2 (or || 6 r || 2 , ||c r || 2 ) for r = 1, • • • , k and the matrix A (or B, C) 
is normalized so that each column has unit norm. The final low-rank approximation is obtained by 

Eti ® i>i © at. 

There is no guarantee that ALS converges or gives a good tensor decomposition. Nevertheless, it 
works reasonably well in most applications m. In general ALS requires 0(T(n 3 k + k 3 )) compu¬ 
tations and 0(n 3 ) storage, where T is the number of iterations. 
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Table 7: Squared residual norm on top 10 recovered eigenvectors of lOOOd tensors and running time (excluding 
I/O and sketch building time) for plain (exact) and sketched ALS algorithms. Two vectors are considered 
mismatched (wrong) if ||i? — v ||2 > 0.1. 





Residual norm 


No. of wrong vectors 

Running time (min.) 




12 

13 

14 

15 

16 

12 

13 

14 

15 

16 

12 

13 

14 

15 

16 


B = 20 

.71 

.41 

.25 

.17 

.12 

10 

9 

7 

6 

4 

.11 

.22 

.49 

1.1 

2.4 


B = 30 

.50 

.34 

.21 

.14 

.11 

9 

8 

7 

5 

3 

.17 

.33 

.75 

1.6 

3.5 

II 

B = 40 

.46 

.28 

.17 

.10 

.07 

9 

8 

6 

5 

1 

.23 

.45 

1.0 

2.2 

4.7 

b 

Exact/ 

.07 





1 





22.8 






B — 20 

.88 

.50 

.35 

.28 

.23 

10 

8 

7 

6 

6 

.13 

.32 

.78 

1.5 

3.2 


B = 30 

.78 

.44 

.30 

.24 

.21 

9 

8 

7 

5 

6 

.21 

.50 

1.1 

2.2 

4.7 


B = 40 

.56 

.38 

.28 

.19 

.16 

9 

8 

6 

4 

2 

.29 

.69 

1.5 

3.5 

6.3 


Exact/ 

.17 





2 





32.3 






/Calling cp als in Matlab tensor toolbox. It is run for exactly T = 30 iterations. 


C.2 Accelerated ALS via sketching 


Similar to robust tensor power method, the ALS algorithm can be significantly accelerated by using 
the idea of sketching as shown in this work. However, for ALS we cannot use colliding hashes 
because though the input tensor T is symmetric, its CP decomposition is not since we maintain 
three different solution matrices A, B and C. As a result, we roll back to asymmetric tensor sketches 
defined in Eq. ([T]). Recall that given A.B.C £ R nxk we want to compute 

A = T( 1} (C © B)(C t C o B T B) f . (17) 


When k is much smaller than the ambient tensor dimension n the computational bottleneck of Eq. 
(17 1 is T (1 )(C © B), which requires 0{n 3 k) operations. Below we show how to use sketching to 
speed up this computation. 


Let x £ R n be one row in Tjq) and consider (C 0 B ) 1 x. It can be shown that Q5) 

[(COB ) T x\.=bJXc u Vi = 1, ■ - - , As, (18) 

where X £ R nx ™ is the reshape of vector x. Subsequently, the product Tm(C 0 B) can be 
re-written as 

T (1) (C © B) = [T(I, ■ ■ ■ :T(I ,b k ,c k )]. (19) 


Using Proposition [T] we can compute each of T(I. b t . c,j in 0(n + b log b) iterations. Note that 
in general b, yZ a, but Proposition [T] still holds by replacing one of the two s u sketches. As a 
result, T (1) (C 0 B) can be computed in 0(k(n + b log b)) operations once St is computed. The 
pseudocode of fast ALS is listed in Algorithm [2] Its time complexity and space complexity are 
0(T(k(n + Bblogb) + k 3 )) (excluding the time for building «t) and 0(Bb), respectively. 


C.3 Simulation results 

We compare the performance of fast ALS with a brute-force implementation under various hash 
length settings on synthetic datasets in Table [7] Settings for generating the synthetic dataset is 
exactly the same as in Section |5TT| We use the cp_als routine in Matlab tensor toolbox as the 
reference brute-force implementation of ALS. For fair comparison, exactly T = 30 iterations are 
performed for both plain and accelerated ALS algorithms. Table [7] shows that when sketch length 
b is not too small, fast ALS achieves comparable accuracy with exact methods while being much 
faster in terms of running time. 


Appendix D Spectral LDA and fast spectral LDA 

Latent Dirichlet Allocation (LDA, 0) is a powerful tool in topic modeling. In this section we first 
review the LDA model and introduce the tensor decomposition method for learning LDA models, 
which was proposed in 0. We then provide full details of our proposed fast spectral LDA algorithm. 
Pseudocode for fast spectral LDA is listed in Algorithm[3] 

3 As long as A is symmetric, we have (A, Y;) = (A, Z;)/3. 
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Algorithm 3 Fast spectral LDA 
1 : Input: Unlabeled documents, V, K, ao, B, b. 

2 : Compute empirical moments Mi and M 2 defined in Eq. 

3: [U, S, V] <- truncatedSVD(M 2 , k)\ W ik <- 

4: Build B tensor sketches of M 3 (W, W, W). 

5: Find CP decomposition {A,}f =1 , A = B = C = {vi}^ =1 of M 3 (W, W, W) using either fast 
tensor power method or fast ALS method. 

6 : Output: estimates of prior parameters a, = and topic distributions fi t = 

D.l LDA and spectral LDA 


(20 


211. 


LDA models a collection of documents by a topic dictionary <t> £ R v x K and a Dirichlet prior 
a. £ &J'\ where V is the vocabulary size and k is the number of topics. Each column in $ is 
a probability distribution (i.e., non-negative and sum to one) representing the word distribution of 
a particular topic. For each document d, a topic mixing vector h,i £ R ,:: is first sampled from a 
Dirichlet distribution parameterized by a. Afterwards, words in document d i.i.d. sampled from a 
categorical distribution parameterized by 


A spectral method for LDA based on 3rd-order robust tensor decomposition was proposed in (T} 
to provably leam LDA model parameters from a polynomial number of training documents. Let 
x £ R' represent a single word; that is, for word w we have x w = 1 and x w / = 0 for all w' ^ w. 
Define first, second and third order moments Mi, M 2 and M 3 as follows: 

Mi = E[ai]; 


M 2 = E[aii <g x 2 ] - 


Cko 


a 0 + 1 


-Mi <g Mi 


a 0 


( 20 ) 
( 21 ) 

Xi <g *2]) 

(a 0 + l)(a 0 + 2)- 1 ™ (22) 

Here qq = ]R ; a k is assumed to be a known quantity. Using elementary algebra it can be shown 
that 


M 3 = E[cci <g at 2 <g * 3 ] --(E[ati (g x 2 (g M x ] + Efsi (g Mi (g x 2 ] + E[Mi 

a 0 + 2 


2 <4 


-Mi <g Mi (g Mi. 


M 2 = 


M 3 = 


a 0 (ao + l) 
2 


Y aiViVj ; 


OL 0 


7 --r^r Y ® Mi ® Mi- 

\ a o + l)( a o + 2) ^ 


(23) 

(24) 


To extract topic vectors (/r, }f =1 from M 2 and M 3 , a simultaneous diagonalization procedure is 
carried out. More specifically, the algorithm first finds a whitening matrix W £ R vx h with or¬ 
thonormal columns such that W T M 2 W = Ikxk- In practice, this step can be completed by 
performing a truncated SVD on M 2 , M 2 = U Kl and set W ik = XJ ik /\/H kk . Afterwards, 
tensor CP decomposition is performed on the whitened third order moment M 3 (W, W, W) j^to 
obtain a set of eigenvectors {vk} k= i- The topic vectors { H k can be subsequ ently obtained 
by multiplying {v k } k=1 with the pseudoinverse of W. Note that Eq. < | 20 | 21 | 22 [ > are defined in 
exact word moments. In practice we use empirical moments (e.g., word frequency vector and co¬ 
occurrence matrix) to approximate these exact moments. 


6 For a tensor T £ K vxvxv and a matrix W £ R Vxk , the product Q = T(W, W, W) £ R fexfexfc is 
defined as Qii.i2.i3 = Sj 1 ,j 2 ,j3=i Wj2>» 2 Wi 3 .' 3 - 
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D.2 Fast spectral LDA 


To further accelerate the spectral method mentioned in the previous section, it helps to first iden¬ 
tify computational bottlenecks of spectral LDA. In general, the computation of Mi, M 2 and the 
whitening step are not the computational bottleneck when V is not too large and each document is 
not too long. The bottleneck comes from the computation of (the sketch of) M 3 (W, W, W) and 
its tensor decomposition. By Eq. (22 1 , the computation of M 3 (W, W, W) reduces to comput¬ 
ing Mf 3 (W, W, W), E[ Xl <g) x 2 <g> Mi](W, W, W),[]and E[a:i <g> x 2 <g> ® 3 ](W, W, W). The 
first term Mf 3 (W, W, W) poses no particular challenge as it can be written as (W t Mi)® 3 . Its 
sketch can then be efficiently obtained by applying techniques in Section 3.1 In the remainder of 


this section we focus on efficient computation of the sketch of the other two terms mentioned above. 


We first show how to efficiently sketching E[a;i ® (g> tc 3 ] (W, W, W) given the whitening matrix 

W and D training documents. Let TE[aii ® *2 ® a: 3 ](W, W, W) denote the whitened k x k x k 
tensor to be sketched and write T = Yhd= 1 where T fJ r is the contribution of the r/th training 
document to T. By definition, T rf can be expressed as = Nd(W, W, W), where W is the 
V x k whitening matrix and is the V x V x V empirical moment tensor computed on the dth 
document. More specifically, for i,j,k£ {1, • • • , V } we have 

{ n d i{n d j - 1 )(n dk - 2), i = j = k; 
n di (n di - 1 )n dk , i = j,j ^ k; 

n di n dj (n dj - 1) j = k,i^j; 

n di (n di -l)n dj , i = 

n di n dj n dk , otherwise. 

Here m d is the length (i.e., number of words) of document d and n d £ B.' / is the corresponding word 
count vector. Previous straightforward implementation require at least 0(k 3 + m d k 2 ) operations per 
document to build the tensor T and 0(k 4 LT) to decompose it [|30ll29l . which is prohibitively slow 
for real-world applications. In section [3] we discussed how to decompose a tensor efficiently once 
we have its sketch. We now show how to build the sketch of T efficiently from document word 
counts {ra,d}£=r 

By definition, can be decomposed as 

v v 

T d = p® 3 - ^2 n i( w i ® Wi <g> p+Wi ®p® Wi+p (8) Wi ® Wi) + ^2 Zrnwf 3 , (25) 

i— 1 i=l 

where p = Wtt and 1 /;, £ is the vth row of the whitening matrix W. A direct implementation 
is to sketch each of the low-rank components in Eq. ( |25j ) and compute their sum. Since there are 
0(m d ) tensors, building the sketch of T,/ requires O(m d ) FFTs, which is unsatisfactory. However, 
note that {wi}Y = i are fixed and shared across documents. So when scanning the documents we 
maintain the sum of n t and riip and add the incremental after all documents are scanned. In this 
way, we only need 0(1) FFT per document with an additional 0(V) FFTs. Since the total number 
of documents D is usually much larger than V, this provides significant speed-ups over the naive 
method that sketches each term in Eq. ( |25| independently. As a result, the sketch of T can be 
computed in 0(kYY2 d m d ) + (D + V)b\ogb) operations, which is much more efficient than the 
0(fc 2 (£ d 771 d) + Dk 3 ) brute-force computation. 

We next turn to the term E[a:i ® x% ® Mi](W, W, W). Fix a document d and let p = W n d . 
Define q = WMj. By definition, the whitened empirical moment can be decomposed as 

V 

E[tci ® x 2 €> Mi](W, W, W) = ^2 n iP ® P ® (26) 

i=l 

Note that Eq. (|26|) is very similar to Eq. ([25]). Consequently, we can apply the same trick (i.e., 
adding p and riip up before doing sketching or FFT) to compute Eq. (|26[ efficiently. 


7 and also E[xi ® Mi ® £C2](W, W, W), E[Mi ® x\ ® a:2](W, W, W) by symmetry. 
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Appendix E Proofs 

E.l Proofs of some technical propositions 


Proof of Proposition^ We prove the proposition for the case q = 2 (i.e., H is 2-wise independent). 
This suffices for our purpose in this paper and generalization to q > 2 cases is straightforward. For 
notational simplicity we omit all modulo operators. Consider two p-tuples l = (l lt ■ ■ ■ , If) and 
l' = (/'. • • • , l' p ) such that l f l!. Since H is permutation invariant, we assume without loss of 
generality that for some s < p and 1 < i < s we have Z ?: = If Fix t, t' £ [6], We then have 

Pr[H{l)=tAH{l')=t']=Y J E Pr[/i(Zi) + • • ■ + h(l s ) = a] 

& h(l± ) - l - ‘ ■ ■ -\-h(l s 

E Pr[/i(Z s+ i) = ri A • • • A h{l p ) = r p A h(l' s+1 ) = r[ A • •• A h(l' p ) = r' p \. (27) 

r s _)_iH-1 -r v =t—a 

r's+i+~+r' v =t'-a 

Since h is 2p-wise independent, we have 

Pr[/i(Zi)4- \-h.(l s )=a]= E Pr[/i(Zi) = ri A ••• h(l s ) = r s ] = 6 S_1 • ^ = p 

riH- \-r s =a 


E Pr[ft(Z s +i) = n A • • • A h(l p ) = r p A h(l' s+1 ) = r[ A • • • A h{l' p ) = r ' p ] 

r s _)_iH- \-r p =t—a 

<+1-1- \-r' p =t-a 


= +2(p—s—1) . 1 = A 

b 2 (p~ s ) b 2 ' 

Summing everything up we get Pr [H(l) = t/\H(l') = t'] = 1/6 2 , which is to be demonstrated. □ 


Proof of Proposition^ Since both FFT and inverse FFT preserve inner products, we have 

(st, Si,„ * s 2 ,u * S 3 , ei ) = (J'(st), O T(s 2 , u ) oF(s 3>ei )) 

= (^(«t) 0-7 r (si,«) 0-7 r (s2, U ),-7 r (s3,eJ) 

= (J' 1 (J(s T ) °F(si, u ) o J 7 (S2,u)), S 3 , ei ). 

□ 


E.2 Analysis of tensor sketch approximation error 


Proofs of Theorem [I] is based on the following two key lemmas, which states that («a, sg) is a 
consistent estimator of the true inner product (A, B); furthermore, the variance of the estimator 
decays linearly with the hash length b. The lemmas are interesting in their own right, providing 
useful tools for proving approximation accuracy in a wide range of applications when colliding hash 
and symmetric sketches are used. 

Lemma 1. Suppose A, B £ (f) p R" are two symmetric real tensors and let sa, sg £ C b be the 
symmetric tensor sketches of A and B. That is, 


SA{t) 

H(i l,-- ,i P )=t 

(28) 


= ^ ' ' ‘ •T’ipBii,— ■ 

,i p )=t 

(29) 


Assume H{i\, ■ ■ ■ , if = ( h(i\) + ■ ■ ■ + h(i p )) mod b are drawn from a 2-wise independent hash 
family. Then the following holds: 


^h,cr [(^A? 

= (A,B), 

(30) 

Vh,o- [(5a, 5g)] 

z 4 ? A p B ^ 
b 

(31) 
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Lemma 2. Following notations and assumptions in Lemmauj Let {A ,:}™ j and {B,}™ 1 be sym¬ 
metric real n X n x n tensors and fix real vector w £ Then we have 


E 


'Ew i W j (8 Ai ,8Z.) 

'£ l WiW J (8 Ai ,8 i i) 








< 


4 p ||m|| 4 (max i ||Ai|||)(maxj ||B;||^) 


(32) 

(33) 


Proof of Lemma^ 7] We first define some notations. Let l = (Zi, • • • ,l p ) £ [d] p be ap-tuple denoting 
a multi-index. Define A i := Ai lr .. ^ and a(l) := ■ ■ ■ ai . For l, l' £ [n] p , define 6(1, l 1 ) = 1 

if h(li) + • • • + h(l p ) = h(l[) + • • • + h(l' p )( mod b) and 6(1, l') = 0 otherwise. For a p-tuple 
l £ [n] p , let C(l) £ [n] p denote the p-tuple obtained by re-ordering indices in l in ascending 
order. Let M(l) £ N b denote the “expanded version” of l. That is, \A4(I,)}, denote the number of 
occurrences of the index i in l. By definition, ||A4(Z)||i = p. Finally, by definition B i> = Bj' if 
l' = C(l') and Bj' = 0 otherwise. 


Eq. ( |30} is easy to prove. By definition and linearity of expectation we have 

![<Sa, Sg>] = E S & v ■ 


i.v 


Note that 6 and a are independent and 


E \a(l)o(l')] = / )i 

a '- ^ ^ 1 0, otherwise. 


(34) 


(35) 


Also 6(1, l') = 1 with probability 1 whenever C(l) = C(l'). Note that Bp = 0 whenever l’ fi=- C(l'). 
Consequently, 

E[(5 A ,5g)]= ]T A,B £(I) = (A,B). (36) 

le[n]P 


For the variance, we have the following expression for E[(5a, Sg) 2 ]: 

E[(s A , «g) 2 ] = ]T E[^,r)5(r,r')]-E[a(0d(r) f f(r)a(r')]-A i A T .B r B w (37) 

=: E E [t(l,l',r,r')]. (38) 

1,1' ,r,r' 

We remark that E[c 7 (Z)< 7 (Z , )( 7 (r)( 7 (T• , )] = 0 if M.(l) — A i(l') ^ A i(r) — Ad(r'). In the remainder 
of the proof we will assume that A 4(1) — A 4(1') = A4(r) — A 't(r'). This can be further categorized 
into two cases: 

Case 1: l! = C(l) and r' = £(r). By definition ¥.[o(l)d(l')<j(r)d (r')] = 1 and 

E[<5(Z, l')6(r, r')] = 1. Subsequently E[t(Z, l' , r , r')] = AiA r BcB r > and hence 

E E [t(U\r,r')\=J2 A i A r^r = (A,B) 2 . (39) 

l,r,l'=C(l),r'=C(r) l,r 

Case 2: l 1 C(l) or r' £(r). Since M(l) — M(l') = M(r) — M(r') / 0 we 

have E[<$(Z, l')6(r, r')] = 1/b because ft, is a 2-wise independent hash function. In addition, 
E[|<r(Z)(f(Z , )cr(r)a(r , )|] < 1. 

To enumerate all (Z, l', r, r') tuples that satisfy the colliding condition M(l) — Ai(l') = A4(r) - 
Ad(r') ^ 0, we fix[^]||A / l(Z) — Ad(Z , )||i = 2 q and fix q positions each in Z and r (for l' and r' the 
positions of these indices are automatically fixed because indices in l' and r' must be in ascending 

x Note that sum(A4 (Z)) = sum(Af(Z / )) and hence ||Ad(Z) — ■M(Z') ||i must be even. Furthermore, the sum 
of positive entries in ( M(l) — M(l')) equals the sum of negative entries. 
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order). Without loss of generality assume the fixed q positions for both l and r are the first q indices. 
The 4-tuple (l, r, l !, r') with \\M. (Z) — A4(l’)\\i = 2 q can then be enumerated as follows: 

Y t(l,l',r,r') 

,r' 

M(l)-M(l')=M(r)-M(r') 

\\M(l)-M(l')\\ 1 =2q 

= Y Y Y *( io *’£(•?' °l),i°r,£(j or)) 

ie[n]i j£[n]i le[n] p ~ q 
r£[n] p - q 


1 


^ ^ ^ A^oZ A^o^Bjo/B 


Jo r 


i,3'e[n] 

l,r£[n] p - q 


i ^ i Ij' ' ' > I)i‘ i-O) 






|A|||||B!|| 


(40) 


Here o denotes concatenation, that is, i o l = (i l5 • • ■ , i q: Zi, • • • , l p - q ) £ [n] p . The fourth equation 
is Cauchy-Schwartz inequality. Finally note that there are no more than 4 P ways of assigning q 
positions to l and l 1 each. Combining Eq. (39 1 and (40 1 we get 


V[<Sa,S g>] = E[<S A) Sg) 2 ] - (A,B) 2 < 4 P||A|l f l|B|1 ^ , 


which completes the proof. 


□ 


Proof of Lemma^ Eq. ( [32| immediately follows Eq. ( |28| > by adding everything together. For the 
variance bound we cannot use the same argument because in general the m 2 random variables are 
neither independent nor uncorrelated. Instead, we compute the variance by definition. First we 
compute the expected square term as follows: 


E 


Y^jiSA,,^) 


i,3 


Y WiWjWi'Wj> -E [6(l,l')6(r,r')] • E[cr(Z)er(O cr ( r ) cr ( r ')] • [A i ]i[A i /] T .[Bj] I /[Bj/] I 


1,3,1 ,3 

l,l r ,r,r 


(41) 


Define X = i w,A, and Y = i/',B,;. The above equation can then be simplified ; 


E 


Y w i w j(SAi,S^.) 

ij j 

Applying Lemma [T] we have 


= ^ E[^,r)5(r,r')] ■E[ ( T(^)a(r) ( T(r) ( T(r , )] •XjX.YrY^. 

(42) 


IV 


Y W i W j^ A^Sg } 


i,3 


< 


4 P ||X|| 2 ||Y" 2 


FII IIF 


Finally, note that 

||X|||. = Yj WiW i (A*' Aj) < Y w t' u T'l|Ai||F||A j || F < || to || 2 max || A 


i IlF- 


(43) 


(44) 
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□ 


With Lemma [T] and [2j we can easily prove Theorem[T] 


Proof of Theorem^ First we prove the e\ (u) bound. LetA = TandB = m® 3 . Note that || A|| F = 
||T||f and ||B||.f = ||m|| 2 = 1. Note that [T(I, m, m)]* = T(e*, m, u). Next we consider e 2 (it) and 
let A = T, B = e, <g> u <g> u. Again we have |j A||jr = ||T||i? and ||B||^ = 1. A union bound over 
all i = 1, • • • , n yields the result. For the inequality involving w we apply Lemma[2] □ 


E.3 Analysis of fast robust tensor power method 


In this section, we prove Theorem[3] a more refined version of Theorem[2]in Section 4.2 We struc¬ 
ture the section by first demonstrating the convergence behavior of noisy tensor power method, and 
then show how error accumulates with deflation. Finally, the overall bound is derived by combining 
these two parts. 


E.3.1 Recovering the principal eigenvector 

Define the angle between two vectors v and u to be 9 (v. u). First, in Lemma [ 3 ] we show that if 
the initialization vector m 0 is randomly chosen from the unit sphere, then the angle 9 between the 
iteratively updated vector u t and the largest eigenvector of tensor T, V \. will decrease to a point 
that tan 9 (v\, u t ) < 1. Afterwards, in LemmaBl we use a similar approach as in [35] to prove that 
the error between the final estimation and the ground truth is bounded. 

Suppose T is the exact low-rank ground truth tensor and Each noisy tensor update can then be 
written as 

ut+i = T(I,u t ,u t ) + e(u t ), (45) 

where i(u t ) = E(I ,u t ,Ut) + £ 2 is the noise coming from statistical and tensor sketch 
approximation error. 

Before presenting key lemmas, we first define 7 -separation, a concept introduced in JT]. 

Definition 1 ( 7 -separation, (H). Fix i* £ [k], u £ R n and 7 > 0. u is 7 -separated with respect to 
if the following holds: 

\i*(u,Vi*)~ max Xi(u,Vi) >'yX i *(u,v i *). (46) 

ie[k]\{i*} 


Lemma[3]analyzes the first phase of the noisy tensor power algorithm. It shows that if the initializa¬ 
tion vector tto is 7 -separated with respect to V \ and the magnitude of noise i(u t ) is small at each 
iteration t, then after a short number of iterations we will have inner product between u t and V \ at 
least a constant. 


Lemma 3. Let {ui, i? 2 , • • • , vand { Ai , A 2 , • • ■ , A^} be eigenvectors and eigenvalues of tensor 

T £ K nxnx ", where X\ |(ui, Mo) | = max Ai |(ui, Mo)|. Denote V = (ui, • • • , vf) £ R” xfc as the 

te[fc] 

matrix for eigenvectors. Suppose that for every iteration t the noise satisfies 

|(uj,£(M t ))| < ei Vie [n] and ||V T e(Mt)|| < e 2 ; (47) 

suppose also the initialization Uq is 7- separated with respect to 17 for some 7 £ (0.5,1). If 
tan#(ui,Mo) > 1, and 


Cl < min 




+ 2 


-- + a )/ 2 j an d e 2 < 1 -^-^^ Ai |(mi,m 0 )| 


2 V / 2(1 + a) 


(48) 


for some a > 0, then for a small constant p > 0, there exists a T > log 1+a (1 + p) tan 9 (iq, Uq) 
such that after T iteration, we have tan 9 (ui, Ut) < 77 ^) 
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Proof. Let u t +i = T (I ,u t ,u t ) + e(u t ) and u t +1 = u t + 1 / ||€tt+i|| . For a £ (0,1), we try to 
prove that there exists a T such that for t > T 

1 |(t>i,M t+ i)| |(vi,W t+ i)| 


tan 0 (ui,ttt+i) 


(l-^r.tit+r ) 2 ) 172 


1/2 - 


> 1. 


First we examine the numerator. Using the assumption I ( 17 , s(u t )} I < ei and the fact that 

(vi,u t+ 1 } = Aj ( Vi,u t ) 2 + (ui,e(« t )), we have 

|(ui,«t +1 )| > A* ( Vi,u t } 2 - ei > |(wi,«t)| (A* Kwi.tit)! - ei/ |(w», it t )|). 

For the denominator, by Holders inequality we have 

1/2 


(50) 


^ (Ui, Ut+lfj = ( Ai ^ M *) 2 + <++ £ 


1/2 


1/2 


1/2 




Vi=2 


V i—2 


1/2 


Equation < [50] > and ( |5 1 [ 1 yield 

1 \{vi,u t 


tan# (vi,u t +i) 


> 


< max A* |(ui,« t )| + e 2 

< (l - (ui,u t ) 2 ) ^inaxAi |(wi,« t )| + e 2 / (l - (■Ui,M f ) 2 ) 

Ai |(vi,« t )| - ei/ |(vi,itt)| 


(51) 

(52) 

(53) 


1/2 


(l - {vi,u t ) 2 ') 1 max.\ i \(v i ,u t )\+e 2 / (l - (vi,« t ) 2 ) 
1 Ai |(vi,M t )|-ei/|(t>i,M t )| 


1/2 


tan# (vi,u t ) 


max A j |(wi,u t )| + e 2 / (l - (t+,M t ) 2 ) 
\ / 


1/2 


(54) 

(55) 

(56) 


To prove that the second term is larger than 1 + a, we first show that when t = 0, the inequality 
holds. Since the initialization vector is a 7 —separated vector, we have 


Ai |(ui,«o)l - maxA, ; |(t> l ,u 0 )| > 7 A 1 |(vi,m 0 )| , 

i£[k\ 


(57) 


max A, \(vi,u 0 }\ < (1 - 7 )Ai |(t>i,M 0 )| < 0.5Ai |(ui,uo)| , (58) 

»e[fc] 

the last inequality holds since 7 > 0.5. Note that we assume tan# (ui, uq) > 1 and hence 
(r?i,Mo ) 2 < 0.5. Therefore, 


1 — (1 + a )/2 ,, 

E,£ 2 V 2 (l + a) A ‘ l( ” , ’" o)l£ 


/ ,\t/2 

(l-(l + a)/ 2 ) 


2(1 +a) 

Thus, for / = 0. using the condition for ei and e 2 we have 

Ai |(t+«o)| - Cj/ | (vj, up) | _ Ai |(t> ? ;,Mo)l - ei/l(T+,«o)l 


Ai |(u 1 ,M 0 )| • (59) 


' ' 1 | \ v X 5 / | 1 / I \ v l) / | ^ 'U |\ W 15 / | 1 / | \ w X 5 U / | 

max Ai |(ui,«o)| + e 2 / (l - (t>i,M 0 ) 2 ) 0.5Ai |(vi,w 0 )| + e 2 / (l - (v 1 ,u 0 ) 2 ^j 


1/2 


> 1 + a. 


(60) 


The result yields 1/tan# (i?i, Mi) > (1 + a)/tan# (iti, uq) • This also indicates that |(ui,ui)| > 
| (y \, Uq) I , which implies that 

. ( 1 

Ci < mm 


^ Ai ~ 


1 —(l + a)/2^ \2 ^ ^ 1 — (1 + a)/2 |/ V | 

--- I Ai (ui, ii t ) and 62 - 2 v^(l + a) Al t,1 ’ TA * 


( 61 ) 
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(62) 


also holds for t = 1. Next we need to make sure that tor t > 0 

max A* |(uj,u t )| < 0.5Ai |(ui,u t )|. 
1 


In other words, we need to show that 


Al|(«l,1tf)| 
max Ai| (vi ,Ut) \ 


> 2. From Equation ( |58j ), for t = 0, 


_ Ai I{^i - u t )I _ _l_ o Pqi* everv ?" c— [FI 

maxAiK«i,«t)l — 1-7 — y H > 


|(wi,«t+i)| < A i \(v l ,u t )\ 2 + e 1 < |(ui,« t )| (Aj \(vi,u t )\ + ei/ |<v», tt t )|) • 
With equation ( |50l ), we have 

Ai |(vi,w t+ i)| __ Ai |(vi,« t+ i)| Al l( v i> M t)l ( A i \( v i: u t}\ - 


'' ’— ‘lllll = ' 1 ' 1 > -a- 

A* \{vi, M t+ i)| Aj \(vi, «t+i)| Aj |(wi, u t )| ^Aj \(vi, u t 

Ai |('»i,it t )| 

A,: \{v %1 u t )\ 

Ai \(vi,u t }\ 

Ai \(vi,u t )\ 




1 - 


Ai(i>i ,u t ) 


\i_ £ i ( Ai |(r)i,Mf)| A ' 

^1 Ai<1)i ,U t ) 2 \ Ai|<Di,«t)| ) 


(63) 


(64) 


(65) 


1 - 


Ai{«i ,ut)' 2 


max Aj / 

I ^€[ fc ] _ei_ / Ai|(i>i, 

Ai A 1 (v 1 ,u t ) 2 \ A* j (vi,- 


u t) I 

“t)| 


1 - 


Ai (vi,u t ) 2 


Ai l<t>i.«t>l 

*d(«i.“t>l 


max ie[fc] A 


( 66 ) 

(67) 


Ai {v\,ut) 2 


Let k = 


—. For t = 0, with conditions on f \ the following holds: 


Aj |(fl,Mi) 

Ai \{vi,ui) 


1 - 


> 


\ 1 (v 1 ,u 0 ) 2 


A i I ( v i- u n) 

AilK.-uo)! 


ie W ■ 


Ai Ai (ui ,uq) 2 


1 - 


> 1 


4/^+2 


= 2 


4 ' 4k+2 

With the two conditions stated in Equation following the same step in ( |60| i, we 
;—-r > (1 + a) ;—wf -_ By induction, r— 57 ^-- > (1 + a) 7 — jk — 77 . for t 

tana(vi,it 2 ) — ' ' tan 6{v\ ,ui) J ’ tan 0{vi ,ut+i) — A / tan0(vi ,£) 

Subsequently, 

1 > (1 + a) T 1 


tan 9 {v 1 ,ut) tan 9(vi,u 0 )' 

Finally, we complete the proof by setting T > log 1+Q , (1 + p) tan 9 (it!, u 0 ). 


( 68 ) 


(69) 

have 
> 0. 


(70) 

□ 


Next, we present Lemma[4j which analyzes the second phase of the noisy tensor power method. The 
second phase starts with tan 9(v 1 , u 0 ) < 1, that is, the inner product of and u 0 is lower bounded 
by 1 / 2 . 

Lemma 4. Let V\ be the principal eigenvector of a tensor T and let Uq be an arbitrary vector in 
that satisfies tan0(i>i, u 0 ) < 1. Suppose at every iteration t the noise satisfies 

4||e(« t )|| < e(Ai - A 2 ) and 4| (vx, e(u t ))| < (Ai - A 2 ) cos 2 9 (iti, u 0 ) (71) 

for some e < 1. Then with high probability there exists T = O ^^ 37 ^ log(l/e)^ such that after T 
iteration we have tan 9 (ui, ut) < e. 


Proof Define A := 


Al 4 Aa and X := v , . We have the following chain of inequalities: 
'X T (T (I,u,u)+s(«))|| 


tan0 (t?i, T (I, u, u) + e(u)) < 


\v\ (T (I ,u,u) + e(u) 


(72) 
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< 


< 


|x t t (i, it, m) || + ||v t £(m)| 

||ffT (I, m, m)|| — ||ffe(M)|| 

A 2 ||X t m||- + ||e(tt)|| 

Ai Iff m| 2 — Iff e(m)| 


X t m 


a 2 


|f f u | 2 \ _ K g H , 

1 1 1 Al Ai - 




< tan 2 #(f i, u) 

< max e 


A 2 Ae (l + tan 2 # (f i, m)) 
A 2 + 3A A 2 + 3A 

A 2 + Ae 2 


’ A 2 + 2A 
A 2 + Ae 
A 2 + 2A 

The second step follows by triangle inequality. For u = u 0 , using the condition tan (f 1; u 0 ) < 1 


< max e 


■ tan 2 # (f i, u) 
tan# (f i, m) 


(73) 

(74) 

(75) 

(76) 

(77) 

(78) 


A 2 -{- Ae ^ 9 /i / \\ ( A 2 T Ae 


A 2+2 A tan ^(^«)) < max ( .e, + tan 9 (f i, m) ) (79) 


we obtain 

tan#(f 1 ,M 1 ) < max |^e, 

Since ^ max (XX+A’ e ) ^ (A 2 /Ai) 1/4 < 1, we have 

tan# (f i, Mi) = tan# (f i, T (I, uq, m 0 ) + e(u t )) < max ^e, (A 2 /Ai) 1//4 tan# (f i, m 0 )^ < 1. 

(80) 

By induction, 

tan# (f i, M t+ i) = tan# (v i,T (I, Mj, u t ) + s(u t )) < max ^e, (A 2 /Ai) 1 ^ 4 tan# (f i, M t )^j < 1. 

for every t. Eq. ( |78[ i then yields 

tan# (f i, ut) < max ^e, maxe, (A 2 /Ai) L ^ 4 tan# (f i, m 0 )^ . (81) 

Consequently, after T = log^/^-j-i/^l/e) iterations we have tan# (f i, ut) < e. □ 

Lemma 5. Suppose fi is the principal eigenvector of a tensor T and let Uq € K”. For some 
a, p > 0 and e < 1, if at every step, the noise satisfies 


|e(M t )|| < e ~—r~ an d |(fl,£(«*))! < min 


1 


^ l-(f + a)/2 x \ 1 


4"^\ gW Ai + 2 > 2 v / 2(1 + a) 

Ai 


(82) 


then with high probability there exists an T = O ^logi +a (1 + p) Ty/n + log(l/e)J such 

that after T iterations we have || (/ — UtU f ) f 1 1| < e. 


Proof By Lemma 2.5 in [35], for any fixed orthonormal matrix V and a random vector m, we 
have maxjgrjf] tan#(f i; u) < Ts/n with all but 0(r _1 + e~ n ^) probability. Using the fact that 
cos 9 (f i, m 0 ) > 1/(1 + tan # (f i, m 0 )) > the following bounds on the noise level imply the 
conditions in Lemma [2 


rT~/„. \U ^ 1 - (1 + a )/2 


V J e(M t ) < 


2\/2(l + oi)T%Jn 


and |(fi,e(M t ))| 


< min 


i 1 — (1 + oi)/2 ^ 1 Wi 

; Al ’- 9 - Al v< - 
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Note that | (ui, e(u t ))| < Aiimplies the first bound in Eq. (83 i. In Lemmajdj we 

assume tand (t>i, uq) < 1 and prove that for every u tl tan# (v\, uf) < 1, which is equivalent 
to saying that at every step, cos 0 ( v-\. uf) > By plugging the inequality into the second con¬ 
dition in Lemma 4 we have | (ui, e(u t )) \ < —The lemma then follows by the fact that 
|| (J — utUt t ) iti]| = sin 0 ( ut, v\) < tan 6 (ut , vf) < e. □ 


E.3.2 Deflation 

In previous sections we have upper bounded the Euclidean distance between the estimated and the 
true principal eigenvector of an input tensor T. In this section, we show that error introduced from 
previous tensor power updates can also be bounded. As a result, we obtain error bounds between 
the entire set of base vectors {vi}f =1 and their estimation {ui}* =1 . 

Lemma 6. Let {v\, • • • , v &} and {Ai, A 2 , • • • , Xk} be orthonormal eigenvectors and eigenval¬ 
ues of an input tensor T. Define A max := max i6 [ fc ] Ai. Suppose and are estimated 

eigenvector/eigenvalue pairs. Fix e > 0 and any t £ [fc]. If 

|Aj — A*| < Aie/2, and \\iii — Ui\\ < e (83) 

for all i £ [t], then for any unit vector u the following holds: 

2 


£ K 3 


- A,;t>® 


03 


(I ,u,u) 


2 2 


<4 (2.5A max H - (^max H - 1-5)6) 6 


9(1 + e/2) 2 Xj 


+ 8(1 + e/2) 2 A 

<50AL x £ 2 - 


2 \ 2 2 

max c 


(84) 

(85) 

( 86 ) 


Proof. Following similar approaches in jT), Lemma B.5, we define v = v, (vj vf)Vi and 
D, = Xvf 3 - A ivf 3 . D i(I, u. u) can then be written as the sum of scaled v, and vj products 
as follows: 

D i (I, M, u) =A i(u T Vi) 2 Vi - Xi(u T Vi) 2 Vi (87) 

=X z (u r v i ) 2 v l - Ai(w T (vj- + (vjvfv^) 2 (v 1 - + (vjvi)vi^j (88) 

= ((Ai - Xi( v J ^i) 3 ) ( u T Vi) 2 - 2X i (u T v^)(vJi> i ) 2 (u T v i ) - Xi(vjVi)(u T i> ± ) S J i 


- Ai 


((u T Vi)(vJVi) + u T v^ (vf / vf ) 


(89) 


Suppose Ai and If are coefficients of v, and ( vf / 
be bounded as 


, respectively. The summation of D, can 




2—1 


2=1 


2=1 


( --L / 

-_L 

\ 



) 



t 

2 

t 



<2 

2=1 

+ 2 

J2 B ‘ (*<■/ 
2=1 

vj- 

) 


We then try to upper bound | Ai \. 


< E A ‘+ 2 


2=1 


y 2=1 


\Ai\ < (Ai - Xi(vji>i ) 3 ) ( u T Vi ) 2 - 2Xi(u T vj-)(vJvi) 2 (u T Vi) - Xi(vj v i )(u T v J -) 


(90) 


< (A* 11 - {vjVi) 3 1 + Ai - Ai (vjVi) 3 ') (u T Vi) 2 + 2 (a* + Ai - Ai ) 


Vi -Vi u vf 


+ (Ai + 


^2 ^7 


) II Vi - Vi 


(91) 
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< 


(l-5||ui - Vi 


+ (Aj + 


Ai Ai 


A,-A 

) ||«i - Vi 


(a* + A* - A* ^ ||ui - Will) |« T Wi 


+ 2 (Aj + 
2 


< (2.5Ai + (Ai + 1.5)e) e + (1 + e/2)Ai£ 2 

Next, we bound \Bj | in a similar manner. 


\Bi\ = 
<2 


Ai 


it T Wi) 07 wi) + u' i)i 


Til 


^Ai + Ai — Ai ) vf ^(tt T Wi ) 2 + w 4 ~ ^ 


<2(1 -t- g/2)A ic{u Vi) + 2(1 + e/2)A 


(92) 

(93) 

(94) 

(95) 

(96) 

0 


Combining everything together we have 


^D,; (!,«,«) 


2=1 


< 2 ^A 2 + 2 ^|^| (97) 

2=1 \ 2=1 / 
t 

< 53 4 (5Aj + (A, + 1.5 )) 2 e 2 |u T Wif + 4(1 + e/2) 2 A 2 e 4 
2=1 

+ 2 ^ 2(1 + e/2)A,e(tx T w l ) 2 + 2(1 + e/ 2 )A,e 3 j (98) 

t 

<4 (2.5A max + (A max + 1.5)e) e 2 ^ ' |tx T Wj | + 4(1 + e/2) 2 A^j ax e 4 

i=i 

+ 2 (2(1+ e/2)A max e ^(u T Wi) 2 + 2(1 + e/2)A max e 3 ] (99) 


<4 (2.5A max + (A max + 1.5)e) e 2 + 9(1 + e/2) 2 A I ^ lax e 4 + 8(1 + e/2) 2 A^ lax e 2 

( 100 ) 


E.3.3 Main Theorem 

In this section we present and prove the main theorem that bounds the reconstruction error of fast 
robust tensor power method under appropriate settings of the hash length b and number of indepen¬ 
dent hashes B. The theorem presented below is a more detailed version of Theorem [2] presented in 
Section l4~2l 

Theorem 3. Let T = T + E € Rnxnxn^ w f lere ^ = A tvf 3 and {wi }^ =1 is an orthonormal 

basis. Suppose (f>i, Ai), (wi, Ai), • • • (vk- A k) is the sequence of estimated eigenvector/eigenvalue 
pairs obtained using the fast robust tensor power method. Assume ||E|| = e. There exists constant 
C i, C 2 , C 3 , a,p,r> 0 such that the following holds: if 

£ < Ci — 7 ——, and T = C 2 flog 1+a (1 + p) Tyfn + Al log(l/e)) , (101) 

^Vax \ Ai — A 2 J 

and 

/ ln(L/log 2 (fc/ 7 ? )) _ / _ In (In Lj log 2 (fc/ 77 )) + C 3 _ / ln( 8 ) \ / / ln(4) 

V ln ( fc ) V 41n(L/log 2 (&/»7)) V H L /^og 2 {k/p))J ~ ' ^ y ln(fc) 

( 102 ) 
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Suppose the tensor sketch randomness is independent among all tensor product evaluations. If 
B = f2(log(n/r)) and the hash length b is set to 


b> 


ml r 4 n 2 


16e- 2 ||T|||. 


^- 2 ||T|| f 


'_i_x l—(i+oQ /2 , \ 2 ’ min i6[fc ] (A* - Ai_i ) 2 ’ 

^4max ie[fc] (A;/Ai)+2 A1 ’ 2y/2(l +a) Al J 

with probability at least 1 — (77 + r -1 + e~ n ), there exists a permutation tt on k such that 

k 

I - A*r(t\€ 1 

|^7r(j) || — ^5 


^n(J) A? 


< — . and 

~ 2 ’ 


T-£***. 

j =1 


< ce, 


(103) 


(104) 


/or some absolute constant c. 


Proof. We prove that at the end of each iteration i £ [A:], the following conditions hold 


• 1. For all j < i, — Vj \ < e and 

• 2. The tensor error satisfies 




< Ail 
— 2 


-1 

"hP' 

M 

v> 

< 0 . 

CS> 

co 

^U) v t(j) 

(I ,u,u) 

L V / 

j>i +1 



< 56e 


(105) 


First, we check the case when i = 0. For the tensor error, we have 

K 

T - K(j)vtlj) (I ,u,u) = ||e(u)|| < ||e 2 ,T(w)|| + ||E (I, m, u)|| < e + e = 2e. (106) 
j = 1 

The last inequality follows Theorem [T] with the condition for b. Next, Using Lemma[5] we have that 

||tV(i) - «i|| < £• (107) 

In addition, conditions for hash length b and Theorem [I] yield 


A 7r (i) - Ai < ||ei,r(f 1 )|| + ||T(ifi - v ll v\ - u,v\ - ui)|| < 


; A^A I=i +<;J || T ||^ £ eA, 

(108) 

Thus, we have proved that for i = 1 both conditions hold. Assume the conditions hold up to j = t —1 
by induction. For the fth iteration, the following holds: 


- 1 

M 

v> 

<0. 

^■<0 

CO_^ 


(I, It, It) 

L V / 

j>t + 1 




K 



t 

< 

T 

(I ,U,u) 

+ 

2_^ A l V j *n(j) V ir(j) 





J =1 


< e + a/50A h 


For the last inequality we apply Lemma[6] Since the condition is satisfied, Lemma[5]yields 

||*Mt+i) -ut+i|| < e- (109) 

Finally, conditions for hash length b and Theorem [I] yield 


T(t+1) — A t + i 


< ||£i,t(«i)|| + ||T(F t - V!,Vi - u,v 1 - v ± ) 


e\i 


<e Ai 4 A<-1 +e 3 ||T|l F < (110) 


□ 
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Appendix F Summary of notations for matrix/vector products 


We assume vectors a,b £ C n are indexed starting from 0; that is, a = (ao, cti, • • • , a n _i) and 
b = (bo, bi, ■ ■ ■ , 6 n _i). Matrices A, B and tensors T are still indexed starting from 1. 

Element-wise product For a,b £ C”, the element-wise product (Hadamard product) a o b £ W 
is defined as 

ao b = (a 0 f> 0 ) aibi, ■ ■ ■ , a„_i6„_i). (Ill) 

Convolution For a,b £ C”, their convolution a* b £ C” is defined as 

'y ' a%bj, y ' aibj , •••, y 

( i-\-j ) mod n —0 (z+i) mod n=l (i+j) mod n 

Inner product For a, b £ C", their inner product is defined as 

n 

(a,b) = ^aA, (113) 

i =1 

where bi denotes the complex conjugate of bi. For tensors A,B £ C nxnxn , their inner product is 
defined similarly as 

n 

(A, B) = ^ ' Aj^fcBjj^. (114) 

i,j,k=l 





( 112 ) 


=n—1 


Tensor product For a,b £ C", the tensor product a 0 b can he either an n x n matrix or a vector 
of length n 2 . For the former case, we have 



a 0 b 0 

a 0 bi 

aobn-i 


a © 6 = 

aib 0 

aih 

a\b n -\ 

(115) 


. CL„.-ibo 

dn-lbl 

* &n—\b n — 1 _ 



If a © 6 is a vector, it is defined as the expansion of the output matrix. That is. 


a(8>6= (a 0 6o) a 0 bi, ■ ■ ■ , a 0 6„_i, ai6 0 ,ai6i, • • • 


(116) 


Suppose T is an n x n x n tensor and matrices A £ M” xmi , B £ R nxm2 anc | (j g R nxm 3. -p^g 
tensor product T(A, B, C) is an mi x m2 x m 3 tensor defined by 

n 

[T(A,B,C)] iJife = ^ T,A,..,B ; (117) 

i' ,j' ,k '=1 


Khatri-Rao product For A, B £ <C nxm , their Khatri-Rao product A©B £ C" 2xm is defined as 

A © B = (A(i) © B(i), A( 2 ) © B(2), • • • , A( m ) ® B( m )), (118) 

where and B,,) denote the ith rows of A and B. 

Mode expansion For a tensor T of dimension n x n x n, its first mode expansion T (1 ) £ W lxn 
is defined as 


r t 1>m 

T2,l,l 

Ti,i,2 ’• 
T2,l,2 ’ 

T 2 ,i,„ 

Ti,2,1 

T2,2,l 

. t 1 1 

- 1 - 1 ,n,n 

T2, n ,n 

(119) 

. T n,l,l 

T„,l,2 

Tn,l,n 

T n , 2> l 

T 

-■-n,n,n J 



The mode expansions T( 2 ) and T( 3 ) can be similarly defined. 
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